1: Introduction To The Data


In [1]:
import pandas as pd
academy = pd.read_csv('academy_awards.csv',encoding = 'ISO-8859-1')
academy.head(5)


Out[1]:
Year Category Nominee Additional Info Won? Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0 2010 (83rd) Actor -- Leading Role Javier Bardem Biutiful {'Uxbal'} NO NaN NaN NaN NaN NaN NaN
1 2010 (83rd) Actor -- Leading Role Jeff Bridges True Grit {'Rooster Cogburn'} NO NaN NaN NaN NaN NaN NaN
2 2010 (83rd) Actor -- Leading Role Jesse Eisenberg The Social Network {'Mark Zuckerberg'} NO NaN NaN NaN NaN NaN NaN
3 2010 (83rd) Actor -- Leading Role Colin Firth The King's Speech {'King George VI'} YES NaN NaN NaN NaN NaN NaN
4 2010 (83rd) Actor -- Leading Role James Franco 127 Hours {'Aron Ralston'} NO NaN NaN NaN NaN NaN NaN

2: Filtering The Data


In [2]:
academy['Year'] = academy['Year'].str[:4].astype('int64')
academy.head(1)


Out[2]:
Year Category Nominee Additional Info Won? Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0 2010 Actor -- Leading Role Javier Bardem Biutiful {'Uxbal'} NO NaN NaN NaN NaN NaN NaN

In [3]:
later_than_2000 = academy[academy['Year']>2000]
later_than_2000['Year'].unique()


Out[3]:
array([2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001])

In [4]:
award_categories = ['Actor -- Leading Role','Actor -- Supporting Role','Actress -- Leading Role','Actress -- Supporting Role']
nominations = later_than_2000[later_than_2000["Category"].isin(award_categories)]
nominations['Category'].unique()


Out[4]:
array(['Actor -- Leading Role', 'Actor -- Supporting Role',
       'Actress -- Leading Role', 'Actress -- Supporting Role'], dtype=object)

3: Cleaning Up The Won? And Unnamed Columns


In [5]:
replacements = { "NO": 0, "YES": 1 }
nominations["Won?"] = nominations["Won?"].map(replacements)
nominations["Won"] = nominations["Won?"]
drop_cols = ["Won?","Unnamed: 5", "Unnamed: 6","Unnamed: 7", "Unnamed: 8", "Unnamed: 9", "Unnamed: 10"]
final_nominations = nominations.drop(drop_cols, axis=1)


/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

4: Cleaning Up The Additional Info Column


In [ ]:
additional_info_one = final_nominations["Additional Info"].str.rstrip("'}")
additional_info_two = additional_info_one.str.split(" {'")
movie_names = additional_info_two.str[0]
characters = additional_info_two.str[1]
final_nominations["Movie"] = movie_names
final_nominations["Character"] = characters
final_nominations = final_nominations.drop("Additional Info", axis=1)
final_nominations

5: Exporting To SQLite


In [ ]:
import sqlite3
conn = sqlite3.connect("nominations.db")
final_nominations.to_sql("nominations", conn, index=False)

In [ ]:
query_one = "pragma table_info(nominations);"
query_two = "select * from nominations limit 10;"
print(conn.execute(query_one).fetchall())
print(conn.execute(query_two).fetchall())
conn.close()